Probability, likelihood and Bayes

Day 1

Manuele Bazzichetto

Probability

Meaning? Depends on who you ask to..

Frequentist: Essentially, the (long-run) relative frequency (or proportion) of an event happening

Bayesian: Essentially, the relative plausibility of an event happening given what we already know about what generates events and what we actually observe (i.e., data)


Which one is best?

NONE

Both are useful

Probability rules

“Frequentist” or “bayesian”..probabilities obey to rules:

  • Number bounded between 0 and 1 (i.e., \(0\leq Pr \leq1\))
  • Union (mutually exclusive): \(Pr(A \cup B) = Pr(A) + Pr(B)\), if \(Pr(A \cap B) = 0\)

  • Intersection: \(Pr(A \cap B)\)

  • Union (not mutually exclusive): \(Pr(A \cup B) = Pr(A) + Pr(B) - Pr(A \cap B)\)

  • Joint probability: \(Pr(A) \cdot Pr(B)\), if A and B are independent

  • Independence: \(Pr(A|B)=Pr(A)\) and \(Pr(B|A)=Pr(B)\)

  • Conditional probability: \(Pr(A|B)=\frac{Pr(A \cap B)}{Pr(B)}\)

\(\cup\): read it like “probability of either A OR B or both occurring”

\(\cap\): read it like “probability of A AND B simultaneously occurring”

Probability rules

Note that, under independence between A and B: 1

\(Pr(A|B)=\frac{Pr(A \cap B)}{Pr(B)}\\Pr(A)\cdot Pr(B)=Pr(A \cap B)\)


While, under lack of independence between A and B: 2

\(Pr(A|B)=\frac{Pr(A \cap B)}{Pr(B)}\\Pr(A|B)\cdot Pr(B)=Pr(A \cap B)\)

A note on pdf vs. pmf

Discrete measures 👉 probability

Continuous measures 👉 density


  • The probability for any specific value of a continuous measure is \(0\)

  • Densities are related to (but not exactly the same as) probabilities

  • Both still obey to probability rules: pdf(s) integrate to 1; pmf sum up to 1

A note on cumulative distribution function(s)

Cdf(s) map measures to the probability of them assuming a specific (or lower) value

Usually written as: \(F = Pr(X \leq x)\)

Cdf(s) exist for both continuous and discrete measures

A note on quantiles

Quantiles are values assumed by a measure that split its pdf (or pmf) in two groups of observations

Example: percentiles split a probability distribution in 100 samples of measures of equal size (and probability)

Example: the median is the 2nd quartile

R makes it easy

The Fantastic 4

d*, p*, q*, r*

d*: compute density (cont.) or probability (discr.)

p*: returns \(Pr(measure\leq quantile)\) (mind the tail argument)

q*: returns quantile for a given \(Pr(measure\leq quantile)\) (mind the tail argument)

r*: draw random values of measures from a model

Examples:

Gaussian: dnorm, pnorm, qnorm, rnorm

Binomial: dbinom, pbinom, qbinom, rbinom

Data, models, probability


Data: information we have available

Model: a set of assumptions to describe a simplified version of reality

Parametric model: Model described by parameters (see pdf(s) and pmf(s))

Probability: how measures behave according to our model

Likelihood and ML estimation

We have data and models, what do we do now?

Let’s use data to estimate model parameters!


ID BodyMass
1 5085.467
2 4983.132
3 4384.706
4 4773.966
5 5224.501
6 5272.518
7 4467.005
8 4892.681

Assumption: Body mass of (all existing) Gentoo’s penguins is normally distributed with some mean and variance

Parametric model: \(Gentoo\hspace{1 mm}body\hspace{1 mm}size \sim \mathcal{N}(\mu,\, \sigma^{2})\)

Likelihood function

  • \(Probability(BodyMass = 3000) \rightarrow Pr(BM_i = value_i)\)
  • \(Pr(BM_1 = value_1) \times Pr(BM_2 = value_2) \hspace{1 mm} \times \hspace{1 mm} ... \hspace{1 mm}\times \hspace{1 mm} Pr(BM_n = value_n)\)
  • \(\prod\limits_{i=1}^{n} Pr(BM_i = value_i)\)
  • \(Likelihood \hspace{1 mm} (L) = \prod\limits_{i=1}^{n} Pr(BM_i = value_i|\mu,\sigma^2), \hspace{1 mm} with \hspace{1 mm} Pr \hspace{1 mm} being \hspace{1 mm} the \hspace{1 mm} Gaussian \hspace{1 mm} pdf\)

Maximizing the joint probability of the data|parameters allows finding the parameter(s) that maximize(s) the L of observing the data (under the assumed model)!


Likelihood(parameters|data) = Probability(data|parameters)

Maximum likelihood estimation

Link data, model and L

Data: sample of \(n\) penguins on which we measure BM

Model:

\(BM \sim \mathcal{N}(\mu,\,\sigma^{2})\\\)

Probability (density) for \(BM_i\):

\(f(x) = \frac{1}{\sigma \sqrt{2\pi} } e^{-\frac{1}{2}\left(\frac{BM_i-\mu}{\sigma}\right)^2}\)

L (given model):

\(\prod\limits_{i=1}^{n} \frac{1}{\sigma \sqrt{2\pi} } e^{-\frac{1}{2}\left(\frac{BM_i-\mu}{\sigma}\right)^2}\)

“Move” along combinations of \(\mu\) and \(\sigma^2\) and find those that maximize L

Maximum likelihood estimation

We usually maximize the log-Likelihood (LL) for two main reasons:

  • Products become sums:

\(log(\prod\limits_{i=1}^{n}X_i) = \sum\limits_{i=1}^{n}log(X_i)\)

  • Easier to work with exponential functions (like lots of pdfs and pmfs)

Examples

We will:

  • Estimate the population mean of Gentoo’s body mass using brute force

  • Estimate regression parameters for the relationship between Gentoo’s body mass and flipper length (without using brute force)

  • Estimate rate parameter of a Poisson population

NOW GO TO R..

What I think I am doing


Model:

\(\mu_i = \alpha + \beta \cdot flipper\hspace{1mm}length_i\)

What I am actually doing

Model:

\(Gentoo\hspace{1 mm}body\hspace{1 mm}size_i \sim \mathcal{N}(\mu_i,\, \sigma^{2})\)

\(\mu_i = \alpha + \beta \cdot flipper\hspace{1mm}length_i\)

Poisson

\(Y \sim Pois(\lambda)\\,with\hspace{1mm} Y\hspace{1mm} assuming \hspace{1mm}value\hspace{1mm} \geq 0\)

Pmf: \(Pr(Y) = \frac{\lambda^Y\exp^{-\lambda}}{Y!}\)


  • \(Mean = variance\)
  • Limiting case of binomial with \(N\) large and \(p\) small
  • Used to model counts with not known upper bound
  • Converges to Gaussian as \(\lambda\) gets large

Things to keep in mind

  • Likelihood function \(\neq\) Pdf

  • We found the MLE(s). Does this mean that we now know the population parameters? NO!

  • From the shape of the LL, we can estimate how precisely we estimate population parameters

Bayes’ rule and bayesian stats

The Bayes’ rule

A re-arrangement of conditional probability:

Conditional probability: \(Pr(A|B)=\frac{Pr(A \cap B)}{Pr(B)}\)

  • \(Pr(A|B)=\frac{Pr(A \cap B)}{Pr(B)}\)

  • \(Pr(A|B)Pr(B)=Pr(A \cap B)\)

  • But \(Pr(A \cap B) = Pr(B \cap A)\)

  • And \(Pr(B \cap A) = Pr(B|A)Pr(A)\)

  • So \(Pr(A|B)Pr(B) = Pr(B|A)Pr(A)\)

  • Dividing both sides by \(Pr(B)\) we end up with:

Bayes’ rule: \(Pr(A|B) = \frac{Pr(B|A)Pr(A)}{Pr(B)}\)

But what does it mean??

\(Pr\): we are familiar with them (pdf(s), pmf(s))


\(Pr(B|A)\): what if I tell you that \(B\) is data and \(A\) parameters?

WELL DONE! IT’S THE LIKELIHOOD!



\(Pr(A)\): prior..a model for the parameter(s) 🤯

Why chosing Bayes

  • Not limited to a unique perspective on the DGP
  • Likelihood provides one and only one winner
  • Probably better suited for ecology and observational studies - nature is complex
  • ‘Frequentist’ really talk about experiments..
  • Both are useful!!

Estimating parameters